Topic 4: Classifiers

Math 253: Statistical Computing & Machine Learning

Daniel Kaplan

Classification overview

Day 10 preview

  1. Probability and odds
  2. Multivariate gaussians (Maybe)
  3. Programming activity: Poker hands

Logistic regression

Probability and odds

Probability \(p(event)\) is a number between zero and one.

Simple way to make a probability model for yes/no variable: encode outcome as zero and one, use regression.

Whickham$alive <- as.numeric(with(Whickham, outcome == "Alive"))

Model of mortality in Whickham

res <- mean( alive ~ smoker, data=Whickham)
res
##        No       Yes 
## 0.6857923 0.7611684
res / (1-res)
##       No      Yes 
## 2.182609 3.187050
library(splines)
mod <- lm(alive ~ smoker + age, data = Whickham)
gmodel(mod, ~ age + smoker) %>% gf_jitter(alive ~ age + color:smoker, alpha = .2, data = Whickham)

mod2 <- glm(alive ~ smoker * age, data = Whickham, family="binomial")
summary(mod2)
## 
## Call:
## glm(formula = alive ~ smoker * age, family = "binomial", data = Whickham)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3983  -0.4256   0.2163   0.5598   1.9283  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    8.169231   0.606600  13.467   <2e-16 ***
## smokerYes     -1.457843   0.837232  -1.741   0.0816 .  
## age           -0.133231   0.009953 -13.386   <2e-16 ***
## smokerYes:age  0.022235   0.014495   1.534   0.1250    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1560.32  on 1313  degrees of freedom
## Residual deviance:  942.68  on 1310  degrees of freedom
## AIC: 950.68
## 
## Number of Fisher Scoring iterations: 6
gmodel(mod2, ~ age + smoker)

mod2 <- glm(alive ~ age, data=Whickham, family = "binomial")

gmodel(mod2) %>%
  gf_jitter(alive ~ age, data = Whickham, alpha = .2)

If we’re going to use likelihood to fit, the estimated probability can’t be \(\leq 0\).

Log Odds

Gerolamo Cardano (1501-1576) defined odds as the ratio of favorable to unfavorable outcomes.

For an event whose probability is \(p\), it’s odds are \(w = \frac{p}{1-p}\).

A probability is a number between 0 and one.

An odds is a ratio of two positive numbers. 5:9, 9:5, etc.

“Odds are against it,” could be taken to mean that the odds is less than 1. More unfavorable outcomes than favorable ones.

Given odds \(w\), the probability is \(p = \frac{w}{1+w}\). There’s a one-to-one correspondence between probability and odds.

The log odds is a number between \(-\infty\) and \(\infty\).

Why use odds?

Making Book

Several horses in a race. People bet on each one amounts \(H_i\).

What should be the winnings when horse \(j\) wins? Payoff means you get your original stake back plus your winnings.

If it’s arranged to pay winnings of
\(\sum{i \neq j} \frac{H_i}{H_j}\) + the amount \(H_j\)
the net income will be zero for the bookie.

Shaving the odds means to pay less than the zero-net-income winnings.

Link function

You can build a linear regression to predict the log odds, \(\ln w\). The output of the linear regression is free to range from \(-\infty\) to \(\infty\). Then, to measure likelihood, unlog to get odds \(w\), then \(p = \frac{w}{1+w}\).

Use of glm()

Response should be 0 or 1. We don’t take the log odds of the response. Instead, the likelihood is
- \(p\) if the outcome is 1 - \(1-p\) if the outcome is 0

Multiply these together of all the cases to get the total likelihood.

Interpretation of coefficients

Each adds to the log odds in the normal, linear regression way. Negative means less likely; positive more likely.

Example: Logistic regression of default

names(Default)
## [1] "default" "student" "balance" "income"
ggplot(Default, 
       aes(x = income, y = balance, alpha = default, color = default)) + 
  geom_point() #+ facet_wrap( ~ student)

model_of_default <-
  glm(default == "Yes" ~ balance + income, data = Default, family = "binomial")
gmodel(model_of_default)

gmodel(model_of_default, ~ income + balance)

summary(model_of_default)
## 
## Call:
## glm(formula = default == "Yes" ~ balance + income, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8
logodds <- predict(model_of_default, newdata = list(balance = 1000, income = 40000)) #,
                   # type = "response")
logodds
##         1 
## -5.061006
odds <- exp(logodds)
odds / (1 + odds)
##           1 
## 0.006299244
logistic <- function(x) {exp(x) / (1 + exp(x))}
logistic(-3.36)
## [1] 0.03356922
table(Default$default)
## 
##   No  Yes 
## 9667  333

Linear and Quadratic Discriminant Analysis

Example: Default on student loans

model_of_default2 <-
  lda(default ~ balance + income, data = Default)
model_of_default2
## Call:
## lda(default ~ balance + income, data = Default)
## 
## Prior probabilities of groups:
##     No    Yes 
## 0.9667 0.0333 
## 
## Group means:
##       balance   income
## No   803.9438 33566.17
## Yes 1747.8217 32089.15
## 
## Coefficients of linear discriminants:
##                  LD1
## balance 2.230835e-03
## income  7.793355e-06
predict(model_of_default2, newdata = list(balance = 3000, income = 40000))
## $class
## [1] Yes
## Levels: No Yes
## 
## $posterior
##            No       Yes
## 1 0.008136798 0.9918632
## 
## $x
##        LD1
## 1 4.879445
sector_mod <- lda(sector ~ wage + educ, data = CPS85)
sector_mod
## Call:
## lda(sector ~ wage + educ, data = CPS85)
## 
## Prior probabilities of groups:
##   clerical      const      manag      manuf      other       prof 
## 0.18164794 0.03745318 0.10299625 0.12734082 0.12734082 0.19662921 
##      sales    service 
## 0.07116105 0.15543071 
## 
## Group means:
##               wage     educ
## clerical  7.422577 12.93814
## const     9.502000 11.15000
## manag    12.704000 14.58182
## manuf     8.036029 11.19118
## other     8.500588 11.82353
## prof     11.947429 15.63810
## sales     7.592632 13.21053
## service   6.537470 11.60241
## 
## Coefficients of linear discriminants:
##             LD1        LD2
## wage 0.06196785 -0.2108914
## educ 0.43349567  0.2480535
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9043 0.0957
predict(sector_mod, newdata = list(wage = 10, educ = 16))
## $class
## [1] prof
## Levels: clerical const manag manuf other prof sales service
## 
## $posterior
##    clerical       const     manag      manuf      other      prof
## 1 0.1619905 0.005807399 0.1680084 0.02274195 0.04429026 0.4781032
##        sales    service
## 1 0.07668742 0.04237084
## 
## $x
##        LD1       LD2
## 1 1.352846 0.5336987
all_combos <- expand.grid(wage = seq(0,20,length=100), educ = seq(5,16, length = 100))
res <- predict(sector_mod, newdata = all_combos)$class
all_combos$predicted <- res
ggplot(all_combos, aes(x = wage, y = educ, color = predicted)) + geom_point()

A Bayes’ Rule approach

Suppose we have \(K\) classes, \(A_1, A_2, \ldots, A_K\). We also have a set of inputs \(x_1, x_2, \ldots, x_p \equiv {\mathbf x}\).

We observe \({\mathbf x}\) and we want to know \(p(A_j \downarrow {\mathbf x})\). This is a posterior probability.

Per usual, the quantities we can get from our training data are in the form of a likelihood: \(p({\mathbf x} \downarrow A_j)\).

Given a prior \(p(A_j)\) for all K classes, we can flip the likelihood into a posterior.

In order to define our likelihood \(p({\mathbf x} \downarrow A_j)\), we need both the training data and a model form for the probability \(p()\).

A standard (but not necessarily good) model of a distribution is a multivariate Gaussian. LDA and QDA are based on a multi-variable Gaussian.

Univariate Gaussian

\[p(x) = \underbrace{\frac{1}{\sqrt{2 \pi \sigma^2}}}_{Normalization} \underbrace{\exp(- \frac{(x-m)^2}{2 \sigma^2})}_{Shape}\]

Imagine that we have another variable \(z = x/3\). Geometrically, \(z\) is a stretched out version of \(x\), stretched by a factor of 3 around the mean. The distribution is

\[p(z) = \underbrace{\frac{1}{\sqrt{2 \pi (3\sigma)^2}}}_{Normalization}\ \underbrace{\exp(- \frac{(x-m)^2}{2 (3\sigma)^2})}_{Shape}\]

Note how the normalization changes. \(p(z)\) is broader than \(p(x)\), so it must also be shorter.

The R function dnorm() calculates \(p(x)\) for a univariate Gaussian.

Uncorrelated bivariate gaussian

For independent RVs x and y, p(xy) = p(x)p(y). Show that the normalization is \(\frac{1}{2 \pi \sigma_x \sigma_y}\).

The sigmas multiply in the normalization, like the area of something being stretched out in two orthogonal directions.

Bivariate normal distribution with correlations

\[f(x,y) = \frac{1}{2 \pi \sigma_X \sigma_Y \sqrt{1-\rho^2}} \exp\left( -\frac{1}{2(1-\rho^2)}\left[ \frac{(x-\mu_X)^2}{\sigma_X^2} + \frac{(y-\mu_Y)^2}{\sigma_Y^2} - \frac{2\rho(x-\mu_X)(y-\mu_Y)}{\sigma_X \sigma_Y} \right] \right)\]

If \(\rho > 0\) and \(x\) and \(y\) are both above their respective means, the correlation term makes the result less surprising: a larger probability.

Another way of writing this same formula is using a covariate matrix \({\boldsymbol\Sigma}\).

Or, in matrix form

\[(2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma|^{-\frac{1}{2}}\, \exp\left( -\frac{1}{2}(\mathbf{x}-\boldsymbol\mu)'\boldsymbol\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu) \right)\]

where \[\boldsymbol \Sigma \equiv \left(\begin{array}{cc}\sigma_x^2 & \rho \sigma_x \sigma_y\\\rho\sigma_x\sigma_y& \sigma_y^2\end{array} \right)\]

Therefore \[\boldsymbol \Sigma^{-1} \equiv \frac{1}{\sigma_x^2 \sigma_y^2 (1 - \rho^2)} \left(\begin{array}{cc}\sigma_y^2 & - \rho \sigma_x \sigma_y\\ - \rho\sigma_x\sigma_y& \sigma_x^2\end{array} \right)\]

Shape of multivariate gaussian

As an amplitude plot amplitude plot

Showing marginals and 3-\(\sigma\) contour marginals

Generating bivariate normal from independent

We want to find a matrix, M, by which to multiply iid Z to get correlated X with specified \(\sigma_x, \sigma_y, \rho\). The covariance matrix will be

# parameters
sigma_x <- 3
sigma_y <- 1
rho <- 0.5
Sigma <- matrix(c(sigma_x^2, rho * sigma_x * sigma_y, rho * sigma_x * sigma_y, sigma_y^2), nrow = 2)
n <- 5000 # number of simulated cases
# iid base
Z <- cbind(rnorm(n), rnorm(n))
M <- matrix(c(sigma_x, 0, rho * sigma_y, sqrt(1-rho^2)* sigma_y), nrow = 2)
X <- Z %*% M
cov(X)
##          [,1]      [,2]
## [1,] 8.960097 1.4829914
## [2,] 1.482991 0.9803776

M transforms from iid to correlated.

In formula, we transform from correlated X to iid, so use M\(^{-1}\).

Independent variables \(x_i\)

x1 = runif(1000)
x2 = rnorm(1000, mean=3*x1+2, sd=x1)
plot(x1, x2)

Remember the univariate Gaussian with parameters \(\mu\) and \(\sigma^2\):
\[\frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left(-\frac{(x - \mu)^2}{2 \sigma^2}\right)\]

Situation: Build a classifier. We measure some features and want to say which group a case refers to.

Specific example: Based on the ISLR::Default data, find the probability of a person defaulting on a loan given their income and balance.

names(Default)
## [1] "default" "student" "balance" "income"
ggplot(Default, 
       aes(x = income, y = balance, alpha = default, color = default)) + 
  geom_point()

We were looking at the likelihood: prob(observation | class)

Note: Likelihood itself won’t do a very good job, since defaults are relatively uncommon. That is, p(default) \(\ll\) p(not).

Re-explaining \(\boldsymbol\Sigma\)

Figure 4.8 from ISL

Figure 4.8 from ISL

Imagine two zero-mean variables, \(x_i\) and \(x_j\), e.g. education and age, and suppose that \(v_i = x_i - \mu_i\) and \(v_j = x_j - \mu_j\). I’ll write these in data table format, where each column is a variable and each row is a case and denote this by

\(\left(\begin{array}{c}v_i\\\downarrow\end{array}\right)\) and \(\left(\begin{array}{c}v_j\\\downarrow\end{array}\right)\)

Correlations between random variables \((v_i, v_j)\) are created by overlapping sums of zero-mean iid random variables \((z_i, z_j)\),

\[\left(\begin{array}{cc}v_i &v_j\\\downarrow & \downarrow\end{array}\right) = \left(\begin{array}{cc}z_i & z_j \\ \downarrow & \downarrow\end{array}\right) \left(\begin{array}{cc}a & b\\0 & c\end{array}\right) \equiv \left(\begin{array}{cc}z_i & z_j\\\downarrow & \downarrow\end{array}\right) {\mathbf A} \]

and add in a possibly non-zero mean to form each \(x\). \[\left(\begin{array}{cc}x_i &x_j\\\downarrow&\downarrow\end{array}\right) = \left(\begin{array}{cc}v_i & v_j\\\downarrow&\downarrow\end{array}\right) + \left(\begin{array}{cc}m_i & m_j\\\downarrow&\downarrow\end{array}\right) \] where each of \(\left(\begin{array}{c}m_i\\|\end{array}\right)\) and \(\left(\begin{array}{c}m_i\\|\end{array}\right)\) have every row the same.

The covariance matrix \(\boldsymbol\Sigma\) is

\[{\boldsymbol \Sigma} \equiv \frac{1}{n} \left(\begin{array}{cc}v_i & v_j\\\downarrow&\downarrow\end{array}\right)^T \left(\begin{array}{cc}v_i & v_j\\\downarrow&\downarrow\end{array}\right) = \frac{1}{n} \left(\begin{array}{c}v_i \longrightarrow\\v_j \longrightarrow\end{array}\right) \left(\begin{array}{cc}v_i & v_j\\\downarrow&\downarrow\end{array}\right) \]

Substituting in

\[ \left(\begin{array}{cc}v_i &v_j\\\downarrow & \downarrow\end{array}\right) = \left(\begin{array}{cc}z_i & z_j\\\downarrow & \downarrow\end{array}\right) {\mathbf A} \] we get

\[{\boldsymbol \Sigma} = \frac{1}{n} \left[\left(\begin{array}{cc}z_i & z_j\\\downarrow & \downarrow\end{array}\right) {\mathbf A} \right]^T \left(\begin{array}{cc}z_i & z_j\\\downarrow & \downarrow\end{array}\right) {\mathbf A} = \frac{1}{n} {\mathbf A}^T \left(\begin{array}{c}z_i \longrightarrow\\z_j \longrightarrow\end{array}\right) \left(\begin{array}{cc}z_i & z_j\\\downarrow & \downarrow\end{array}\right) {\mathbf A}\]

Now \(\left(\begin{array}{cc}z_i & z_j\\\downarrow & \downarrow\end{array}\right)\) are iid with zero mean and unit variance, so \[\left(\begin{array}{c}z_i \longrightarrow\\z_j \longrightarrow\end{array}\right) \left(\begin{array}{cc}z_i & z_j\\\downarrow & \downarrow\end{array}\right) = \left(\begin{array}{cc}1 & 0\\0 & 1\end{array}\right)\] so \({\boldsymbol \Sigma} = {\boldsymbol A}^T {\boldsymbol A}\).

In other words, \(\boldsymbol A\) is the Choleski decomposition of \(\boldsymbol \Sigma\).

Operationalizing this in R

library(dplyr)
Sigma <- cov(ISLR::Default %>% dplyr::select(balance, income))
Sigma
##           balance      income
## balance  233980.2   -982142.3
## income  -982142.3 177865954.8
A <- chol(Sigma)
A
##         balance    income
## balance 483.715 -2030.415
## income    0.000 13181.175
n <- 100 # say
Z <- cbind(rnorm(n), rnorm(n))
V <- Z %*% A
M <- cbind(rep(3, n), rep(-2, n))
head(M)
##      [,1] [,2]
## [1,]    3   -2
## [2,]    3   -2
## [3,]    3   -2
## [4,]    3   -2
## [5,]    3   -2
## [6,]    3   -2
X <- V + M
cov(X)
##            balance    income
## balance   221792.1  -1651907
## income  -1651907.3 140496689

Why isn’t this exactly the same as the covariance matrix \(\boldsymbol \Sigma\) that we were aiming at? Because of random fluctuations in the \(\left(\begin{array}{cc}z_i&z_j\\\downarrow&\downarrow\end{array}\right)\). You can reduce the impact of those fluctuations by making \(n\) bigger.

Notice that \({\boldsymbol A}\) transforms from uncorrelated \(\left(\begin{array}{cc}z_i&z_j\\\downarrow&\downarrow\end{array}\right)\) to correlated \(\left(\begin{array}{cc}v_i&v_j\\\downarrow&\downarrow\end{array}\right)\). If we have \(\left(\begin{array}{cc}x_i&x_j\\\downarrow&\downarrow\end{array}\right)\), we can create the uncorrelated \(\left(\begin{array}{cc}z_i&z_j\\\downarrow&\downarrow\end{array}\right)\) with \[\left(\begin{array}{cc}z_i&z_j\\\downarrow&\downarrow\end{array}\right) = \left[\left(\begin{array}{cc}x_i&x_j\\\downarrow&\downarrow\end{array}\right) - \left(\begin{array}{cc}m_i&m_j\\\downarrow&\downarrow\end{array}\right)\right] {\boldsymbol A}^{-1}\].

Recall \[(2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma|^{-\frac{1}{2}}\, \exp\left( -\frac{1}{2}(\mathbf{x}-\boldsymbol\mu)^T\boldsymbol\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu) \right)\].

Since \(\boldsymbol\Sigma = \boldsymbol A^T \boldsymbol A\) and \(x_i - m = v_i\), this formula is equivalent to \[(2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma|^{-\frac{1}{2}}\, \exp\left( -\frac{1}{2}(\mathbf{v}^{T}\boldsymbol A^{-T} \boldsymbol A^{-1} \mathbf v) \right) = (2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma|^{-\frac{1}{2}}\, \exp\left( -\frac{1}{2}(\mathbf{z}^{T}\mathbf z) \right)\] Now for a pair of values like \(\mathbf x = (x_1 \ \ x_2)\) finding the probability of \(\mathbf x\) corresponds to finding the corresponding \(\mathbf z^T = (z_i\ \ z_j)\), where \(z_i\) and \(z_j\) are each a random scalars, and \((\mathbf{z}^{T}\mathbf z) = z_i^2 + z_2^2\), so the probability is

\[(2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma|^{-\frac{1}{2}}\, \exp\left( -\frac{1}{2}(z_1^2 + z_2^2 )\right) = (2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma|^{-\frac{1}{2}}\, \exp(-\frac{z_1^2}{2}) \exp(-\frac{z_2^2}{2})\]

Look at the stretching that goes on due to a matrix:

figure source

The stretching is due to the matrix \(\boldsymbol A\). So we should divide by the determinant of \(\boldsymbol A\), that is, \(|\boldsymbol A|\). The nature of the Cholesky decomposition is that \(|\boldsymbol A| = \sqrt{|\boldsymbol\Sigma|}\). Note in the formula for the Gaussian that the normalizing constant involves \(\sqrt{|\boldsymbol\Sigma|}\).

LDA

All classes are treated as having the same \({\mathbf \Sigma}\).

QDA

Classes are treated with different \({\mathbf \Sigma}_i\).

knitr::include_graphics("Images/Chapter-4/4.9.png")

Figure 4.9 from ISL. Left: Bayes (purple dashed), LDA (black dotted), and QDA (green solid)} decision boundaries for a two-class problem with \({\mathbf \Sigma}_1 = {\mathbf \Sigma}_2\). Right: QDA

Figure 4.9 from ISL. Left: Bayes (purple dashed), LDA (black dotted), and QDA (green solid)} decision boundaries for a two-class problem with ${\mathbf \Sigma}_1 = {\mathbf \Sigma}_2$.  Right: QDA

Error test rates on various classifiers

Figure 4.10 from ISL

Figure 4.10 from ISL

Scenarios: In all, class means are different.

  1. Each class is two uncorrelated Gaussian random vars.
  2. Both classes had a correlation of \(-0.5\)
  3. Uncorrelated, like (1), but the distribution is t(df=?): long tailed to right.

Figure 4.11 from ISL

Figure 4.11 from ISL
  1. Like (2), but one class has \(\rho = 0.5\) and the other \(\rho = -0.5\)
  2. A nonlinear predictor with \(X_1^2\), \(X_2^2\), \(X_1 \times X_2\) giving a quadratic decision boundary
  3. A decision boundary more complicated than a quadratic.

Error rates

Ways to measure the performance of a classifier.

Examples: Two classifiers of employment type.

data(CPS85, package = "mosaicData")
classifier_1 <- lda(sector ~ wage + educ + age, data = CPS85)
classifier_2 <- qda(sector ~ wage + educ + age, data = CPS85)
table(CPS85$sector, predict(classifier_1)$class)
##           
##            clerical const manag manuf other prof sales service
##   clerical       60     0     0     1    17   17     0       2
##   const           7     0     2     4     5    0     0       2
##   manag          15     0     5     1     3   31     0       0
##   manuf          22     0     4    13    14    7     0       8
##   other          21     0     2     6    24    5     0      10
##   prof           15     0     2     0     6   81     0       1
##   sales          25     0     1     0     2    8     0       2
##   service        38     0     1     9    12    7     0      16

Receiver operating curves

There’s always one or more parameters that can be set in a classifier. This might be as simple as the prior probability.

As this parameter is changed, typically sensitivity will go up at the cost of specificity, or vice versa.

President Garfield’s assassination.

Example: Blood Cells

library(ggplot2)
library(MASS)
Cells %>%
  ggplot(aes(x = X, y = Y)) + geom_point(aes(color = Type))

mod <- qda(Type ~ X + Y, data = Cells)
Grid <- expand.grid(X = seq(2590, 66000, length = 151), Y = seq(11000,65000, length = 151))
preds_in <- predict(mod)
preds <- predict(mod, newdata = Grid)
Grid$class <- preds$class
Grid %>% 
  ggplot(aes(x = X, y = Y)) + geom_point(aes(color=class), alpha = .1) + 
  geom_point(data = Cells, aes(x = X, y = Y, color = Type), alpha = .1)

table(Cells$Type, preds_in$class)
##    
##         0     1     2     3     5
##   0  5077    80     0     0   135
##   1    87  2519   238     2    12
##   2     0   292 23927   178     6
##   3     0     0   291   554    10
##   5   295     4     0    11  1815